We will be talking about job salary, with a focus on coding oriented jobs. Salaries have several requirements that have to be taken into account:
and even more. All this will get represented by a single "number" (we wil be using anual USD). Its not easy to split the salary into the parts that give a more objective measurement and comparision with others.
Thanks to the stackoverflow and their community, they release every year survey about the tech stack somebody is using, where they are working from and how much salary they get. This information can be obtained directly from stackoverflow in an anonymized fashing: stackoverflow.com.
Even historical data is available to watch for trends and timeslots, when new technologies, frameworks or languages arise. The data can be downloaded free of charge under the open database licence summary. Its allowed to:
As long as the produced result is made public again. As I am working with a public git repository, this should match the requirements.
Job Salary is always a big topic in the business platforms as LinkedIn and the popular XING (in german speaking region). For both services you have to may an monthly subscription in order to get the required information you are interisted in.
Salary was and will always be a strong argument to search or change a job. On the opposite side, not many people want to share this informations, as they prefer to have no information before they share their salary in exchange.
I want to provide an easy to use web application that can predict a salary based on the top 15 features. This application will get the features as input using a html formular and return the prediction without doing any api calls further:
All potencial privat information will stay on the browser, so the user don't have to worry about the data. I want to make sure that the memory requirements for training aren't high to allow a local training on particular any modern device with 4GB ram. We will be using tensorflow (the keras api) for training and tensorflow js for the prediction, embedded in a static html5 webapp.
When data is available in a public space, this information shoud be as easy as possible to access by everyone. I want to make the data easy available for everyone, so we will be building an app where each feature can be passed in and than we get a prediction what we can compare to the salary we or others get.
As we are dealing with numeric predictions within a range, its a classic regression problem we have to solve. We will be using the Root Mean Squared Error as common metric for this kind of problems. Why RMSE:
import pandas as pd
from config import config
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import os
import json
import numpy as np
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
plt.rcParams['figure.figsize'] = [7, 7]
df_raw = pd.read_csv(config[-1]['data_path'], dtype=str)
df = df_raw[~ df_raw['ConvertedComp'].isnull()]
float(len(df)) / len(df_raw)
0.5391787282232667
len(df_raw)
64461
The data of stackoverflow contains a lot of columns (61 in total). Most of the data is a string of a set of possible answers. Only a small amount contains floating point values. In total, we have 64461 answers, where about 53.9% (34756) contain an answer related to their current job salary, less than we got compared to the job satisfaction we analysed at the beginning of this year (70%).
df_raw.columns
Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
'MiscTechDesireNextYear', 'MiscTechWorkedWith',
'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
'YearsCodePro'],
dtype='object')
Anonymized identifier of a survey response. Unique for each row in the dataset, numeric. Not helpful for our use case, so it can be removed.
An exclusive answer, if developing is the main job the respondent have. Possible values are:
set(df_raw['MainBranch'])
{'I am a developer by profession',
'I am a student who is learning to code',
'I am not primarily a developer, but I write code sometimes as part of my work',
'I code primarily as a hobby',
'I used to be a developer by profession, but no longer am',
nan}
df_raw['MainBranch'].describe()
count 64162 unique 5 top I am a developer by profession freq 47193 Name: MainBranch, dtype: object
Most work as professional developers.
If they code as a hobby.
df_raw['Hobbyist'].describe()
count 64416 unique 2 top Yes freq 50388 Name: Hobbyist, dtype: object
set(df_raw['Hobbyist'])
{'No', 'Yes', nan}
Most code in their freetime as well, some don't give an answer.
The age the respondent had at the survey time.
df_raw['Age'].describe()
count 45446 unique 110 top 25 freq 2693 Name: Age, dtype: object
set(df_raw['Age'])
{'1',
'10',
'11',
'12',
'13',
'14',
'14.5',
'14.7',
'15',
'15.5',
'16',
'16.5',
'17',
'18',
'19',
'19.8',
'20',
'20.5',
'21',
'21.5',
'22',
'22.5',
'23',
'23.4',
'23.8',
'24',
'24.5',
'25',
'26',
'26.5',
'26.8',
'27',
'27.5',
'279',
'28',
'29',
'3',
'30',
'31',
'31.5',
'32',
'32.5',
'32.8',
'33',
'34',
'34.5',
'35',
'35.7',
'36',
'37',
'38',
'39',
'39.5',
'40',
'41',
'42',
'43',
'44',
'45',
'46',
'47',
'48',
'49',
'49.5',
'5',
'50',
'51',
'52',
'53',
'54',
'55',
'56',
'57',
'58',
'59',
'60',
'61',
'62',
'63',
'64',
'65',
'66',
'67',
'68',
'69',
'7',
'70',
'71',
'72',
'73',
'74',
'75',
'76',
'77',
'78',
'79',
'80',
'81',
'83',
'84',
'85',
'86',
'88',
'89',
'94',
'95',
'96',
'97',
'98',
'99',
nan}
Their seem to be no validation on the input data for the Age column - we need to remove the answers in the following steps as I reject an age of 99 to be a professional developer. We have nans here as well.
The amount of years the responent had when they started to code.
df_raw['Age1stCode'].describe()
count 57900 unique 63 top 15 freq 5618 Name: Age1stCode, dtype: object
set(df_raw['Age1stCode'])
{'10',
'11',
'12',
'13',
'14',
'15',
'16',
'17',
'18',
'19',
'20',
'21',
'22',
'23',
'24',
'25',
'26',
'27',
'28',
'29',
'30',
'31',
'32',
'33',
'34',
'35',
'36',
'37',
'38',
'39',
'40',
'41',
'42',
'43',
'44',
'45',
'46',
'47',
'48',
'49',
'5',
'50',
'51',
'52',
'53',
'54',
'55',
'56',
'58',
'6',
'63',
'64',
'68',
'7',
'72',
'73',
'76',
'8',
'83',
'85',
'9',
'Older than 85',
'Younger than 5 years',
nan}
We have nans here again. And beside numeric values, also two options could have been selected with Older than 85 and Younger than 5 years.
How often did they receive the salary? In germany a monthly paid salary is popular but the contract contains a annual salary. In other parts of the world this looks different.
df_raw['CompFreq'].describe()
count 40069 unique 3 top Yearly freq 19818 Name: CompFreq, dtype: object
set(df_raw['CompFreq'])
{'Monthly', 'Weekly', 'Yearly', nan}
Yearly is the most popular one. We have nans here as well.
Where the respondant is living.
df_raw['Country'].describe()
count 64072 unique 183 top United States freq 12469 Name: Country, dtype: object
set(df_raw['Country'])
{'Afghanistan',
'Albania',
'Algeria',
'Andorra',
'Angola',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bhutan',
'Bolivia',
'Bosnia and Herzegovina',
'Botswana',
'Brazil',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Chad',
'Chile',
'China',
'Colombia',
'Congo, Republic of the...',
'Costa Rica',
'Croatia',
'Cuba',
'Cyprus',
'Czech Republic',
"Côte d'Ivoire",
'Democratic Republic of the Congo',
'Denmark',
'Djibouti',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Eritrea',
'Estonia',
'Ethiopia',
'Fiji',
'Finland',
'France',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Greece',
'Grenada',
'Guatemala',
'Guinea',
'Guyana',
'Haiti',
'Honduras',
'Hong Kong (S.A.R.)',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Kosovo',
'Kuwait',
'Kyrgyzstan',
"Lao People's Democratic Republic",
'Latvia',
'Lebanon',
'Lesotho',
'Libyan Arab Jamahiriya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Mauritania',
'Mauritius',
'Mexico',
'Micronesia, Federated States of...',
'Monaco',
'Mongolia',
'Montenegro',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Nomadic',
'North Korea',
'Norway',
'Oman',
'Pakistan',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Poland',
'Portugal',
'Qatar',
'Republic of Korea',
'Republic of Moldova',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Lucia',
'Saint Vincent and the Grenadines',
'San Marino',
'Saudi Arabia',
'Senegal',
'Serbia',
'Sierra Leone',
'Singapore',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Korea',
'Spain',
'Sri Lanka',
'Sudan',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'Taiwan',
'Tajikistan',
'Thailand',
'The former Yugoslav Republic of Macedonia',
'Timor-Leste',
'Togo',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United Republic of Tanzania',
'United States',
'Uruguay',
'Uzbekistan',
'Venezuela, Bolivarian Republic of...',
'Viet Nam',
'Yemen',
'Zambia',
'Zimbabwe',
nan}
Most answer where from the USA (almost 20%).
The Currency that is used in the country. This was used by stackoverflow to convert the salary into USD dollars based on exchange rates of a particular day.
Databases that are or will be used in the next year.
df_raw['DatabaseWorkedWith'].describe()
count 49537 unique 2808 top MySQL freq 3827 Name: DatabaseWorkedWith, dtype: object
set(df_raw['DatabaseWorkedWith'][0:1])
{'Elasticsearch;Microsoft SQL Server;Oracle'}
This is the first non-exclusive answer. From a given list of options, the user could select multiple. This is encoded in the source file by a semicolon character in the field. This can create a uniq combination for each user and have to be converted in the following steps to reduce the feature space.
Type of developer the respontent is. Multiple answer allowed.
df_raw['DevType'].describe()
count 49370 unique 8269 top Developer, full-stack freq 4424 Name: DevType, dtype: object
set(df_raw['DevType'][0:1])
{'Developer, desktop or enterprise applications;Developer, full-stack'}
Highest education level the respontent got - only one answer allowed.
df_raw['EdLevel'].describe()
count 57431 unique 9 top Bachelor’s degree (B.A., B.S., B.Eng., etc.) freq 26542 Name: EdLevel, dtype: object
set(df_raw['EdLevel'])
{'Associate degree (A.A., A.S., etc.)',
'Bachelor’s degree (B.A., B.S., B.Eng., etc.)',
'I never completed any formal education',
'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)',
'Other doctoral degree (Ph.D., Ed.D., etc.)',
'Primary/elementary school',
'Professional degree (JD, MD, etc.)',
'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
'Some college/university study without earning a degree',
nan}
Almost half of the users have a Bachelor's degree.
How their current job status is.
df_raw['Employment'].describe()
count 63854 unique 7 top Employed full-time freq 45270 Name: Employment, dtype: object
set(df_raw['Employment'])
{'Employed full-time',
'Employed part-time',
'Independent contractor, freelancer, or self-employed',
'Not employed, and not looking for work',
'Not employed, but looking for work',
'Retired',
'Student',
nan}
Most are employed on a full-time base.
Ethnical Background
df_raw['Ethnicity'].describe()
count 45948 unique 208 top White or of European descent freq 29318 Name: Ethnicity, dtype: object
set(df_raw['Ethnicity'][0:100])
{'Black or of African descent',
'East Asian',
'Hispanic or Latino/a/x',
'Hispanic or Latino/a/x;White or of European descent',
'Indigenous (such as Native American, Pacific Islander, or Indigenous Australian)',
'Middle Eastern',
'Multiracial',
'South Asian',
'White or of European descent',
'White or of European descent;Indigenous (such as Native American, Pacific Islander, or Indigenous Australian)',
nan}
Multiple options could be selected. Most (64%) are White or of European descent.
df_raw['Gender'].describe()
count 50557 unique 7 top Man freq 46013 Name: Gender, dtype: object
set(df_raw['Gender'][0:100])
{'Man', 'Man;Non-binary, genderqueer, or gender non-conforming', 'Woman', nan}
Almost all participans where man (91%)
What are the important factors for choosing a job?
df_raw['JobFactors'].describe()
count 49349 unique 230 top Flex time or a flexible schedule;Languages, fr... freq 2777 Name: JobFactors, dtype: object
set(df_raw['JobFactors'][0:10])
{'Diversity of the company or organization;Languages, frameworks, and other technologies I’d be working with;Office environment or company culture',
'Diversity of the company or organization;Remote work options;Office environment or company culture',
'Flex time or a flexible schedule;How widely used or impactful my work output would be;Opportunities for professional development',
'Flex time or a flexible schedule;Office environment or company culture;Opportunities for professional development',
'Languages, frameworks, and other technologies I’d be working with;Remote work options;Opportunities for professional development',
'Remote work options;Opportunities for professional development;Family friendliness',
nan}
Multiple Options where available. Top answer combination with 2777 responses:
Job Satisfaction. Only one option could be used.
df_raw['JobSat'].describe()
count 45194 unique 5 top Very satisfied freq 14611 Name: JobSat, dtype: object
set(df_raw['JobSat'])
{'Neither satisfied nor dissatisfied',
'Slightly dissatisfied',
'Slightly satisfied',
'Very dissatisfied',
'Very satisfied',
nan}
Most are very satisfied with their job (32%).
If they are currently searching for a new job.
df_raw['JobSeek'].describe()
count 51727 unique 3 top I’m not actively looking, but I am open to new... freq 29776 Name: JobSeek, dtype: object
set(df_raw['JobSeek'])
{'I am actively looking for a job',
'I am not interested in new job opportunities',
'I’m not actively looking, but I am open to new opportunities',
nan}
Most are open for new opportunities but are not looking for a new one (58%).
Programming languages that the user work with or want to work with. Multiple answers possible.
df_raw['LanguageWorkedWith'].describe()
count 57378 unique 14256 top HTML/CSS;JavaScript;PHP;SQL freq 1073 Name: LanguageWorkedWith, dtype: object
df_raw['LanguageDesireNextYear'].describe()
count 54113 unique 16243 top Python freq 1152 Name: LanguageDesireNextYear, dtype: object
Top Combination here is HTML, CSS, JavaScript, PHP and SQL. Top value for next year is Python
Different Tools that the user work with or want to work with. Multiple answers possible.
df_raw['MiscTechWorkedWith'].describe()
count 40314 unique 2730 top Node.js freq 7151 Name: MiscTechWorkedWith, dtype: object
df_raw['MiscTechDesireNextYear'].describe()
count 42379 unique 5216 top Node.js freq 4134 Name: MiscTechDesireNextYear, dtype: object
Most have used Node.js as Tool or mant to start using it in the next year.
List of Collaboration Tools, most response is the github platform.
df_raw['NEWCollabToolsWorkedWith'].describe()
count 52883 unique 1153 top Github freq 4343 Name: NEWCollabToolsWorkedWith, dtype: object
Does your company have a dedicated DevOps person?
df_raw['NEWDevOps'].describe()
count 42686 unique 3 top Yes freq 18712 Name: NEWDevOps, dtype: object
Most have a DevOps Team
Following are not so important columns, but for have it complete, I list them here. Description was extracted from the zip from stackoverflow itself.
How important is the practice of DevOps to scaling software development?
How important is a formal education, such as a university degree in computer science, to your career?
In general, what drives you to look for a new job? Select all that apply.
When job searching, how do you learn more about a company? Select all that apply.
How frequently do you learn a new language or framework?
Do you think Stack Overflow should relax restrictions on what is considered “off-topic”?
Do you think your company has a good onboarding process? (By onboarding, we mean the structured process of getting you settled in to your new role at a company)"
Are you a member of any other online developer communities?
How often do you work overtime or beyond the formal time expectation of your job?
When buying a new tool or software, how do you discover and research available solutions? Select all that apply
You search for a coding solution online and the first result link is purple because you already visited it. How do you feel?
Which of the following Stack Overflow sites have you visited? Select all that apply.
What do you do when you get stuck on a problem? Select all that apply.
What is the primary operating system in which you work?
Approximately how many people are employed by the company or organization you currently work for?
Which platforms have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the platform and want to continue to do so, please check both boxes in that row.)
What level of influence do you, personally, have over new technology purchases at your organization?
Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank."
Do you have a Stack Overflow account?
Do you consider yourself a member of the Stack Overflow community?
How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions."
How frequently would you say you visit Stack Overflow?
How easy or difficult was this survey to complete?
How do you feel about the length of the survey this year?
Are you transgender?
What was your primary field of study?
Which web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.)"
Compared to last year, how welcome do you feel on Stack Overflow?
On average, how many hours per week do you work? Please enter a whole number in the box
Including any education, how many years have you been coding in total?
df_raw['YearsCode'].describe()
count 57684 unique 52 top 10 freq 4510 Name: YearsCode, dtype: object
Most have 10 years experience'
NOT including education, how many years have you coded professionally (as a part of your work)?
df_raw['YearsCodePro'].describe()
count 46349 unique 52 top 3 freq 4815 Name: YearsCodePro, dtype: object
Most have 3 years of processional coding experience
In this Chapter we will look a bit more close to the distribution on the values
def load(column:str, year:int) -> pd.DataFrame:
'''
Calculate features and cache result. Further calls will return the precalculated results
:param df: (pd.DataFrame) The DataFrame with the data to be processed
:param column: (str) The column with the data to process
:param year: (int) the year of the data, used as cache key
:returns: go.Figure, ready to use plotly figure
'''
cache_key = f"{year}_{column}.json"
cache_path = os.path.join('data', 'features', cache_key)
if os.path.exists(cache_path):
return pd.read_json(cache_path, lines=True)
def plot_distribution(column:str, year:int, sort_by:str='values') -> go.Figure:
'''
Plots the sum of the positives of a partical column with multilable values.
:param df: (pd.DataFrame) The DataFrame with the data to be plotted
:param column: (str) The column with the data to plot
:returns: go.Figure, ready to use plotly figure
'''
df_cached = load(column, year)
if "CompTotal" in df_cached.columns:
df_cached = df_cached.drop("CompTotal", axis=1)
if "Respondent" in df_cached.columns:
df_cached = df_cached.drop("Respondent", axis=1)
if sort_by == 'values':
df_sum = df_cached.sum().to_frame()
df_sorted = df_sum.sort_values(by=0, ascending=True)
y = list(map(lambda x: x.split('_')[-1], df_sorted.index))
x = df_sorted.values.flatten()
fig = go.Figure(data=[go.Bar(x=x, y=y, text=y, orientation='h', )])
elif sort_by == 'numeric':
col = df_cached.columns
real_col = list(filter(lambda x: 'NA' not in x, col))[0]
na_col = list(filter(lambda x: 'NA' in x, col))[0]
x = df_cached[df_cached[na_col].isnull()][real_col]
fig = go.Figure(data=[go.Histogram(x=x, bingroup=100) ])
return fig
def plot_numeric(df, column):
x = df[~ df[column].isnull()][column].astype("float")
fig = go.Figure(data=[go.Histogram(x=x, bingroup=100) ])
return fig
plot_numeric(df_raw, 'Age').show()
The age distrubution looks suprising, as we have on the one hand a heavy outlier with a value of 279 what is with current medical equipment impossible. But all values higher than 65 seems a bit unrealisitc as well, because you may go on pension and stop "coding".
I would consider removing all outliers from 3 times the stddev from the dataset in a preprocessing step. This implied for the really low values as well.
We have to remove / convert the already mentioned categorial values from the data in order to plot them
query = np.logical_or(df_raw['YearsCode'] == 'Less than 1 year', df_raw['YearsCode'] == 'More than 50 years')
plot_numeric(df_raw[~ query], 'YearsCode').show()
This looks realistic, interisting that a lot picked a round numer (10, 20, 30, 40) - maybe they don't remember more precise.
query = np.logical_or(df_raw['YearsCodePro'] == 'Less than 1 year', df_raw['YearsCodePro'] == 'More than 50 years')
plot_numeric(df_raw[~ query], 'YearsCodePro').show()
With a peak of 3 years professional coding experience, it looks like more junior developer are visiting stackoverflow.
Several preprocessing steps where implemented. We will go in details here what each is doing and why it was required. They are explained in order of execution. Having the preprocessing in several files, allowed me to simplify accuring problems while coding and focus on solving one task for a step. Most columns can be nan- that needed a lot of attention.
This was the first preprocessing step. It explodes the list of possible answers into one-hot encoded columns for non-numeric answers. For example the column LanguageWorkedWith allowed multiple answers. When we found a semicolon, we extracted this values to own column. This resulted in a very wide table. I was not able to train the model on this representation (loading in pandas failed already). Nans where encoded as own column.
After I once completed all steps, the loss function exploded into inf instead of decreasing. This was fixed by adding scaling to all columns (in particular the numeric values), including the target value. This executable calculates the min and max by each column. But its not the absolute min and max, instead it is a value I would consider as lowest possible (3 * stddev of mean) or highest possible. After exploding, you can have only categorical values with a 1 values - this is specially handled to not mess up the following steps.
Reading the data again and "mark" all Respondent id that should be removed from the dataset due the missing the min max constraints.
Now we finally scale the data by dividing using the previously calculated max value (we ignore the min value). This step will skip rows marked by reject.py, indirectly removing them from the dataset.
After having implemented the full preprocessing step, I still was not able to train the data with memory restrictions. Thats by I cheated a bit and trained on a (I call it) feature group. What is a feature group: For each column, I added the target label and the one-hot encoded ones into one file and trained it in keras.
As this was a lot less heavy (in amount of columns) it worked very well. The model itself is as simple as possible. As I don't have that many samples, adding more layers would not improve it that much and increase the inference time on the prediction side later - performance wins for my use case. The relevant part from shared/train.py:
def build_and_compile_model():
model = keras.Sequential([
layers.Dense(1)
])
model.compile(loss=tf.keras.losses.MeanSquaredError(),
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
metrics=[tf.keras.metrics.MeanSquaredError(),
tf.keras.metrics.MeanAbsoluteError()])
return model
and the fit call itself:
model = build_and_compile_model()
history = model.fit(X_train.values, y_train.values,
verbose=1,
epochs=epochs, batch_size=1,
validation_data=(X_test, y_test))
return model, history, list(df.columns)
`
After running all models, I had to pick the best one. I decided to use the best 15 models together that where training only on one feature group. This number is considered (by me) as a good ratio between time to input the data in the prediction case and quality of the model. The second positive aspect: It makes sure it fits into the memory. The 15 models where selected by their RSME score: The 15 lowest ones where used in train_all.py.
Finally the best selected features are combined together, creating the final model. We do this by split the best features into 5 folds and train on each fold with a classic 80/20 split. The Model with the lowest RMSE in any epoch is seleted as winner. All models share a low RMSE score beside the first fold, we may be unlucky with the data or the test set (extrated from stdout):
| Fold | val RMSE Epoch 1 | val RMSE Epoch 2 | val RMSE Epoch 3 |
|---|---|---|---|
| 0 | 125.12760162353516 | 132.7395782470703 | 135.8756103515625 |
| 1 | 0.0006983008934184909 | 0.0006387992762029171 | 0.000556076702196151 |
| 2 | 0.009237580001354218 | 0.008935610763728619 | 0.00845410581678152 |
| 3 | 0.00038578390376642346 | 0.0002545823808759451 | 0.0001253084046766162 |
| 4 | 0.003603751305490732 | 0.0033430070616304874 | 0.0029510962776839733 |
Our best Epoch got a RMSE score of 0.0001 - really good! As of this good results I did not do any further hyperparemter search.
Now we are missing the serving part. I decided to embedd this keras model in a static website using tensorflow js. This executable will generate the entire html template that can than be opended and used locally, without any other api calls, preventing any data flow from outside the browser into the world.
evaluation = []
for f in os.listdir(os.path.join("data", "metrics")):
with open(os.path.join("data", "metrics", f), 'rt') as metric_reader:
data = json.loads(metric_reader.read())
data['file'] = f
evaluation.append(data)
pd.DataFrame(evaluation).sort_values('val_mean_squared_error')[['file', 'val_mean_squared_error', ]][0:16]
| file | val_mean_squared_error | |
|---|---|---|
| 28 | 2020_Country.json | [0.009920666925609112] |
| 19 | 2020_CurrencyDesc.json | [0.010503113269805908] |
| 6 | 2020_CurrencySymbol.json | [0.010630987584590912] |
| 53 | 2020_YearsCode.json | [0.010961346328258514] |
| 5 | 2020_YearsCodePro.json | [0.012053127400577068] |
| 29 | 2020_WebframeWorkedWith.json | [0.012106635607779026] |
| 9 | 2020_CompFreq.json | [0.012175439856946468] |
| 44 | 2020_WelcomeChange.json | [0.012430550530552864] |
| 39 | 2020_NEWSOSites.json | [0.01255364716053009] |
| 27 | 2020_Ethnicity.json | [0.01259028259664774] |
| 7 | 2020_JobSeek.json | [0.012596573680639267] |
| 11 | 2020_Age.json | [0.012640145607292652] |
| 2 | 2020_Sexuality.json | [0.012656370177865028] |
| 33 | 2020_OpSys.json | [0.01281856931746006] |
| 31 | 2020_JobSat.json | [0.012859459035098553] |
| 52 | 2020_NEWCollabToolsWorkedWith.json | [0.012873232364654541] |
As you can see from the output, each individual model was not able to outperform the combined one, that have a magnitude lower val_mean_squared_error - Combined them together give us a very good performance and return robust results.
lets compare this results with the final model:
pd.read_json(os.path.join("data", "meta", "metrics.json"))
| loss | mean_squared_error | mean_absolute_error | val_loss | val_mean_squared_error | val_mean_absolute_error | |
|---|---|---|---|---|---|---|
| 0 | 4280.350098 | 4280.350098 | 10.097062 | 0.000386 | 0.000386 | 0.019641 |
| 1 | 0.106147 | 0.106147 | 0.261078 | 0.000255 | 0.000255 | 0.015956 |
| 2 | 0.078890 | 0.078890 | 0.221760 | 0.000125 | 0.000125 | 0.011194 |
After one epoch, we can see that the MSE is 0.010, that it increased again to 0.011 and gets back to 0.010 - our model did not improve after the 3. epoch, so I canceled the calculation.
The model did very well in predicting the job salary.
We did several preprocessing steps to get the data in a format that can be "understand" by ML. Exploding the columns, rejecting outliers, scaling the data. To reduce memory we trained only on one feature group and combined the best 15 models together. They get served in the (local) browser to have a save environment to get predictions from it.
Job Salary is always a hot topic. The more transparent we are handling it, the better for everybody involved. A neutral source and this model can help to get a better feeing for the salary. Challinging was the inf loss while training before the scaling was implemented. I took me quite a long time to figure it out.
In the data fetching step I added other years of survey as well (beside 2020), but did not use them in my solution - it may be worth it to try to merge the different survey sources together to have more data points available. As the sulution have a high automatic stage, more manual work can be added in the validation of values from the columns - for example the max salary after the scaling is still very high (500k USD) what is imo out of bounce for most developers I know. Maybe the US market is different.
If you want to run the entire pipeline locally, just use the all.sh script in the project root directory. For more details on installation, please visit README.md. This will do everything you need (even download the data) - finally you are able to serve the generated html template using the build-in python webserver (as its static html/js) - the server will be run in the script as well. It took about 30 minutes to run everything.
You can get the Code from the following github repository: https://github.com/dariusgm/stackoverflowcapstone